A. The Paradigm Shift

From pattern matching to content creation

Agenda

  • A. Paradigm Shift — GenAI vs Traditional AI
  • B. Hallucinations — When models get it wrong
  • C. Transformer Engine — The architecture behind LLMs
  • D. Attention Mechanism — How models “focus”
  • E. Model Zoo — Choosing the right model type
  • F. Case Studies — Real-world enterprise adoption

Traditional AI vs Generative AI

Traditional ML (Discriminative)

  • Classifies, predicts, labels
  • Trained on labeled datasets
  • Specialized architectures
  • Output: score, label, prediction
Input → Features → Pattern → Label

Generative AI

  • Creates, generates, synthesizes
  • Trained on massive text corpora
  • General-purpose models
  • Output: text, code, images, audio
Prompt → Context → Probability → Novel Output

Key Distinctions

Aspect Traditional AI Generative AI
Function Analyze & Classify Create & Generate
Output Prediction, Score, Label Text, Code, Image, Audio
Training Data Labeled datasets Unlabeled text corpora
Example Spam detection, Fraud analysis Article writing, Code completion
Complexity Specialized architectures Massive general-purpose models

Enterprise Task Distribution

The Best of Both Worlds

Production Pattern

In enterprise systems, you often combine both approaches:

  • Traditional ML for fraud detection on transactions
  • Generative AI to explain the decision in natural language for compliance reports

“The question isn’t traditional vs generative — it’s knowing when to use each.”

B. Hallucinations

When models confidently get it wrong

How It Actually Works

The AI predicts the next word based on probability.

“The capital of Saudi Arabia is…”

  • Riyadh (99.8%) ✅
  • Jeddah (0.1%)
  • Paris (0.00001%)

It doesn’t “know” geography. It knows that “Riyadh” almost always follows that sentence.

When It Guesses Wrong (Hallucinations)

What happens when it doesn’t know the answer? It guesses.

“According to Article 214 of the new Saudi Investment Law: …”

  • The (45%)
  • Foreign (30%)
  • Capital (15%)

The AI constructs a sentence that sounds legal, but it has never read the actual document.

Controlling the Output

How we steer the model’s creativity

The Probability Distribution

Let’s look at that Saudi Investment Law example again. The model had these probabilities for the next word:

“According to Article 214 of the…”

  • The (45%)
  • Foreign (30%)
  • Capital (15%)
  • Unicorn (0.01%)

The Decision

It doesn’t always have to pick the top one. We use Temperature, Top-K, and Top-P to control how “adventurous” it gets.

Temperature: The “Creativity” Dial

Temperature adjusts the shape of that probability curve.

Low Temp (< 0.5) - Sharpened probabilities

High Temp (> 0.8) - Flattened probabilities

Top-K: The “Shortlist”

Top-K simply chops off the list after \(K\) items.

“According to Article 214…”

  • The (45%)
  • Foreign (30%)
  • Capital (15%)Cut off (if K=2)
  • Unicorn (0.01%)Cut off

Impact

Here, setting \(K=2\) forces it to stick to the very most likely words, preventing weird deviations like “Unicorn”.

Top-P (Nucleus): The “Smart” Cutoff

Top-P sums up probabilities until it hits \(P\) (e.g., 0.9).

Scenario A: The Law (Ambiguous)

  • The (45%) + Foreign (30%) + Capital (15%) = 90%
  • Result: All three are kept.

Scenario B: The Capital (Certain)

“The capital of Saudi Arabia is…”

  • Riyadh (99.8%)
  • Result: Hit 90% immediately. Stops there.

Why Top-P wins

It adapts to the context. It’s strict when the answer is clear (Riyadh), and flexible when it’s open-ended (Legal text).

Cheat Sheet: Configuration

Parameter Effect Recommended Value
Temperature Randomness 0.0 (Fact/Code) – 0.7 (Chat) – 1.0 (Creative)
Top-P Dynamic vocabulary 0.9 (Standard), 1.0 (Disable)
Top-K Hard vocabulary limit 40-100 (Standard)

Best Practice

Usually, you tune Temperature and Top-P. Leave Top-K alone or at default.

Summary: Controlling the Chaos

We just saw how to tune the model. But remember: tuning doesn’t change the underlying reality.

Why Hallucinations Still Happen

  1. Probability, not truth We can sharpen the curve (low temp), but the model still doesn’t know facts.

  2. Plausibility over accuracy The model wants to sound correct, not be correct.

  3. Training Gaps If it never saw the “Saudi Investment Law”, it will confidently invent it, no matter the temperature.

The Core Issue

It’s not a “bug” — it’s a feature. The same mechanism that writes a new poem also writes a fake law.

Production Impact

Real Liability

A customer-facing chatbot that hallucinates legal or medical advice creates real legal and financial liability for the organization.

  • Code generation → subtle bugs that pass code review
  • Document summarization → phantom claims not in source
  • Customer support → wrong answers delivered confidently
  • Research assistance → fabricated citations and statistics

Mitigation Strategies

Strategy How It Works When to Use
RAG Ground responses in retrieved documents Most enterprise apps (Module 03)
Output Validation Verify against structured data/rules Factual claims, numbers
Temperature Control Lower temp = less creative = fewer errors Factual tasks (0.1–0.3)
Human-in-the-Loop Human review before action Critical decisions

Preview

You’ll build a full RAG system in Module 03 — the industry-standard approach to grounding LLM outputs in real data.

C. The Transformer Engine

The architecture that powers modern AI

The Road to Transformers

graph LR
    A["RNNs<br/>(1980s)"] --> B["LSTMs<br/>(1997)"]
    B --> C["Attention<br/>(2014)"]
    C --> D["Transformers<br/>(2017)"]
    D --> E["GPT / BERT<br/>(2018+)"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#9B8EC0,stroke:#1C355E,color:#fff
    style C fill:#00C9A7,stroke:#1C355E,color:#fff
    style D fill:#FF7A5C,stroke:#1C355E,color:#fff
    style E fill:#1C355E,stroke:#1C355E,color:#fff

The breakthrough: “Attention Is All You Need” (2017) replaced sequential processing with parallel attention — enabling massive scale-up.

Transformer Architecture (Simplified)

G cluster_input cluster_transformer Transformer Block (× N) cluster_output Input Input Text Token Tokenization Input->Token Attn Multi-Head Attention Embed Embedding Token->Embed PosEnc Positional Encoding Embed->PosEnc PosEnc->Attn FFN Feed Forward Network Attn->FFN Output Output Projection Norm Layer Normalization FFN->Norm Norm->Output Pred Next Token Prediction Output->Pred

Why Transformers Won

RNNs / LSTMs

  • Process tokens one at a time
  • Forget long-range context
  • Hard to parallelize
  • Training: days → weeks

Transformers

  • Process all tokens at once
  • Attend to any position
  • Massively parallel (GPU-friendly)
  • Training: hours → days

Key Insight

Parallelism is what enabled scaling from millions to trillions of parameters — and why modern LLMs require GPU clusters.

D. The Attention Mechanism

How models learn to “focus”

Attention as a Highlighter

Imagine reading a long document and highlighting the most relevant parts for a question.

Without attention:

The model treats every word equally — like reading the entire book for every question.

With attention:

The model learns which words matter most for the current prediction — like a skilled researcher scanning for key passages.

Query, Key, Value — The Library Analogy

Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V

The Library Analogy:

  • Q (Query) — Your research question
  • K (Key) — Book titles on the shelf
  • V (Value) — The actual content inside

How it works:

  1. Compare your query to every key
  2. Score how relevant each book is
  3. Read more from high-scoring values
  4. Blend information proportionally

Don’t memorize the formula — understand the intuition.

Attention Visualization

Attention weights showing how “it” attends to “animal” — the model learned this relationship from data

Multi-Head Attention

Instead of one attention mechanism, transformers use 8–128 parallel heads, each learning different relationship types:

Head What It Learns Example
Head 1 Grammar subject ↔︎ verb agreement
Head 2 Semantics synonyms, antonyms
Head 3 Long-range pronoun ↔︎ antecedent
Head 4 Position nearby word relationships

Why Multiple Heads?

One head might track grammar while another tracks meaning. Combined, they give the model a rich, multi-dimensional understanding of each token’s context.

Computational Cost: O(n²)

The Quadratic Problem

Memory scales with sequence length squared. Every token must attend to every other token.

Context Length Attention Computations
1K tokens ~1 million
8K tokens ~64 million
128K tokens ~16 billion

This is why context window choices have real cost and latency implications — we’ll explore this in the lab.

E. The Model Zoo

Choosing the right model for the job

The Model Spectrum

graph LR
    A["Base Model<br/>(raw completion)"] --> B["Instruction-Tuned<br/>(follows commands)"]
    B --> C["Chat-Optimized<br/>(multi-turn dialogue)"]
    C --> D["Domain-Specialized<br/>(expert knowledge)"]

    style A fill:#9B8EC0,stroke:#1C355E,color:#fff
    style B fill:#00C9A7,stroke:#1C355E,color:#fff
    style C fill:#FF7A5C,stroke:#1C355E,color:#fff
    style D fill:#1C355E,stroke:#1C355E,color:#fff

Base vs Instruct vs Chat

Base Models

  • GPT-3 base, LLaMA base
  • Complete text patterns
  • No instruction following
  • Best for: creative writing, code gen
prompt = "Once upon a time"
# Continues the story...

Instruction-Tuned

  • InstructGPT, Alpaca
  • Follow specific commands
  • Single-turn tasks
  • Best for: summarization, extraction
prompt = "Summarize this in
3 bullet points: ..."

Chat-Optimized

  • GPT-4, Claude, Gemini
  • Multi-turn conversation
  • Safety-aligned
  • Best for: assistants, chatbots
messages = [
  {"role": "system", ...},
  {"role": "user", ...}
]

Model Capabilities Comparison

Domain-Specialized Models

Domain Model Specialization
Code Codex, CodeLlama Programming languages, debugging
Medicine Med-PaLM, BioGPT Clinical text, medical QA
Finance BloombergGPT Financial analysis, market data
Legal Harvey AI Contract review, legal research

Choosing Guide

  • User-facing app? → Chat-optimized (safety + multi-turn)
  • Batch processing? → Instruction-tuned (faster, cheaper)
  • Specific domain? → Domain-specialized or fine-tuned
  • Prototyping? → Start with the best general model, specialize later

F. Case Studies & Wrap-up

GenAI in the real world

Real-World Case Studies

GitHub Copilot

  • Code-specialized model
  • IDE integration
  • ~30% suggestion acceptance
  • Lesson: Specialized > general for domain tasks

ChatGPT Enterprise

  • Chat-optimized + enterprise
  • 128K context window
  • SOC 2 compliant
  • Lesson: Security & compliance > raw capability

Jasper.ai

  • Instruction-tuned for marketing
  • Brand voice consistency
  • SaaS pricing model
  • Lesson: Vertical AI commands premium pricing

Enterprise Adoption Matrix

Source: McKinsey, Gartner

Key Takeaways

  1. GenAI creates — traditional AI classifies. Use both together.
  2. Transformers enable parallelism → massive scale-up since 2017.
  3. Attention lets models focus on what matters — but costs O(n²).
  4. Choose your model type based on the task: base, instruct, chat, or specialized.
  5. Hallucinations are inevitable — always have a mitigation strategy (RAG, validation, human review).
  6. Enterprise adoption is accelerating — but security and compliance drive decisions, not just capability.

Up Next

Lab 1: Hands-on tokenization and cost analysis — see how tokens, context windows, and costs connect in practice.